So far…

  • We have been considering main/linear effects
    • What’s the effect of x1 on y? Of x2?
    • Yes, they can change due to collinearity
    • But it’s still a main effect:
    • For a given dataset, we modelled the effect of x as unchanging

Up next…

  • But how might the effect of x change?
    • The effect of x gets higher/lower for higher/lower values of x
    • The effect of x1 gets higher/lower for higher/lower values of x2

Up next…

  • But how might the effect of x change?
    • The effect of x gets higher/lower for higher/lower values of x \(\longrightarrow\) higher-order polynomial
    • The effect of x1 gets higher/lower for higher/lower values of x2 \(\longrightarrow\) interaction

Quadratic (increasing)

Quadratic (decreasing)

Quadratic

\(y \sim b_0 + b_1*x + b_2*x^2\)

Can you think of any reasonable examples?

What’s the harm in ignoring it?

When should I care? THEORY (+ fit)

Model comparison

Try to improve fit, but only if it’s worth it

BIC = one measure of “worth it”; prefers more parsimonious models

mod_linear <- lm(y~x,data_quad)
summary(mod_linear)
## 
## Call:
## lm(formula = y ~ x, data = data_quad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2013 -0.4230 -0.0765  0.5587  4.3390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.22172    0.18532   1.196    0.234    
## x            1.77471    0.07086  25.044   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.416 on 98 degrees of freedom
## Multiple R-squared:  0.8649, Adjusted R-squared:  0.8635 
## F-statistic: 627.2 on 1 and 98 DF,  p-value: < 2.2e-16

Model comparison

Try to improve fit, but only if it’s worth it

BIC = one measure of “worth it”; prefers more parsimonious models

mod_quadratic <- lm(y~x+I(x^2),data_quad)
summary(mod_quadratic)
## 
## Call:
## lm(formula = y ~ x + I(x^2), data = data_quad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9126 -0.3740 -0.0976  0.3809  5.5373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12406    0.17037   0.728    0.468    
## x            2.29383    0.13086  17.529  < 2e-16 ***
## I(x^2)      -0.11373    0.02493  -4.562 1.48e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.292 on 97 degrees of freedom
## Multiple R-squared:  0.8887, Adjusted R-squared:  0.8864 
## F-statistic: 387.4 on 2 and 97 DF,  p-value: < 2.2e-16

Model comparison

##               df      BIC
## mod_linear     3 365.2070
## mod_quadratic  4 350.3721

Model comparison

Interactions

  • What is an interaction?
    • Verbal
    • Visual
    • Math
  • Re-leveling factors/centering variables
  • Power
  • When you should include an interaction

Example without an interaction

  • Imagine you are developing a new feed for salmon farming
  • Let’s say on average it produces a 300g increase in fish size

Example without an interaction

  • But then you remember that male fish are 400g bigger on average
  • Want to check you’ve controlled for that

Example without an interaction

  • So we have 2 main effects:
    • new feed: +300g
    • gender male: +400g
  • Assuming these effects are independent of each other, and given that an average female fish on the regular feed weighs 2000g, what would you expect the average weight to be for:
    • a female on the new feed?
    • a male on the regular feed?
    • a male on the new feed?

Note: this is not about a correlation betw. gender and feed. What would that even mean?

Example without an interaction

  • weight = \(b_0 + b_1*gender + b_2*feed\)
    • (imagine female = 0, male = 1; control=0, new=1)
  • weight = \(2000 + 400*gender + 300*feed\)
mod1 <- lm(weight ~ gender + feed,
           data=fish_data)
coef(mod1)
## (Intercept)  gendermale     feednew 
##   1992.2421    394.4623    320.4373

Example without an interaction

  • What’s the name of this shape?

Example without an interaction

  • What’s the name of this shape?

What is an interaction?

  • Imagine the new feed had a different effect on males and female fish
  • E.g., +400g for females; +200g for males
  • Thus, the effect of one variable depends on the value of another

What is an interaction?

What is an interaction?

What is an interaction?

What is an interaction?

Your turn

  • weight = \(b_0 + b_1*gender + b_2*feed + b_3*gender*feed\)
    • (gender: female = 0, male = 1; feed: control = 0, new = 1)
  • Try map the parameters onto by-group means!

What is an interaction?

Adding in an interaction

What is an interaction?

##        (Intercept)         gendermale            feednew gendermale:feednew 
##          2009.6081           396.0327           400.6258          -215.9386
## # A tibble: 4 x 3
## # Groups:   gender [2]
##   gender feed        m
##   <chr>  <chr>   <dbl>
## 1 female control 2010.
## 2 female new     2410.
## 3 male   control 2406.
## 4 male   new     2590.

What is an interaction?

  • Verbally: Are the effects independent? Or does the strength of one effect depend on the value of the other?
  • Visually: parallelogram or not?
    • Always plot your data!
  • Mathematically: \(b_1x_1 + b_2x_2 + b_3x_1x_2\)

What is an interaction?

  • We had a reason to look for an interaction term (mileage can vary)
    • Knowing gender has an effect on size
    • Wanting to know if the feed works on all salmon
    • Wanting to know whether the mechanism by which the feed increases size has anything to do with gender-variable hormones,
    • etc.
  • Your analyses should be theory driven

Centering/relevelling

What if we exclude the interaction term?

Run a linear regression with just the two main effects (data: fish_data_interaction)

Centering/relevelling

Why did the numbers change?

## 
## Call:
## lm(formula = weight ~ gender + feed, data = fish_data_interaction)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -618.07 -136.60   -1.32  135.55  713.65 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2063.593      8.027  257.10   <2e-16 ***
## gendermale   288.063      9.268   31.08   <2e-16 ***
## feednew      292.656      9.268   31.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.2 on 1997 degrees of freedom
## Multiple R-squared:  0.4957, Adjusted R-squared:  0.4952 
## F-statistic: 981.5 on 2 and 1997 DF,  p-value: < 2.2e-16

Centering/relevelling

Why did the numbers change?

Centering/relevelling

Where did my effect go?

New example: you’re testing whether a new “cognitive enhancement” drug speeds up reaction times in a decision making task. Control group gets a placebo; experimental group gets the drug. Pre- and post-treatment measures. Assuming that the drug works:

  • What do you expect to happen to the RT in the control group?
  • What do you expect to happen to the RT in the experimental group?

Centering/relevelling

Where did my effect go?

Centering/relevelling

Where did my effect go?

Does it look like there’s an effect of the drug? Yes!

So do you expect:

  • a significant main effect of drug?
  • a significant interaction term?

Centering/relevelling

Where did my effect go?

  • Model just the main effects (data = drug_data)

Centering/relevelling

Where did my effect go?

  • Model just the main effects (data = drug_data)
## 
## Call:
## lm(formula = rt ~ group + phase, data = drug_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1603.52  -322.26   -13.99   313.68  2061.26 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1406.18      19.16  73.410   <2e-16 ***
## groupexpt           -206.97      22.12  -9.357   <2e-16 ***
## phaseposttreatment  -200.67      22.12  -9.073   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 494.6 on 1997 degrees of freedom
## Multiple R-squared:  0.07839,    Adjusted R-squared:  0.07747 
## F-statistic: 84.93 on 2 and 1997 DF,  p-value: < 2.2e-16

Centering/relevelling

Where did my effect go?

  • Add an interaction term

Centering/relevelling

Where did my effect go?

  • Add an interaction term
## 
## Call:
## lm(formula = rt ~ group * phase, data = drug_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1503.55  -308.91    -1.22   301.64  1961.29 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1306.212     21.667  60.287   <2e-16 ***
## groupexpt                      -7.028     30.641  -0.229    0.819    
## phaseposttreatment             -0.733     30.641  -0.024    0.981    
## groupexpt:phaseposttreatment -399.879     43.333  -9.228   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 484.5 on 1996 degrees of freedom
## Multiple R-squared:  0.1161, Adjusted R-squared:  0.1148 
## F-statistic: 87.39 on 3 and 1996 DF,  p-value: < 2.2e-16

Centering/relevelling

Where did my effect go?

Centering/relevelling

Where did my effect go?

Centering/relevelling

Where did my effect go?

By default:

  • No interaction term: population average effect
  • Interaction term (effect of \(x_1\) depends on \(x_2\)):
    • \(b_1x_1\): when \(x_2=0\)
    • \(b_2x_2\): when \(x_1=0\)
    • \(b_3x_1x_2\): here’s where \(x_1\) and \(x_2\) can vary

Centering/relevelling

Where did my effect go?

So what does 0 mean here?

By default:

  • numeric variable: literal 0
  • categorical variable: first alphabetically (female - male; control - new; baseline - posttreatment; control - expt)

So what can you do?

  • center your numeric variables
  • center/relevel your categorical variables

Centering

Numeric variables

drug_data %>% 
  mutate(rt_scaled = scale(rt)) %>% 
  head
##   id   group    phase        rt   rt_scaled
## 1  1 control baseline 1659.2623  0.88730013
## 2  2 control baseline  620.3499 -1.13026810
## 3  3 control baseline 2134.6126  1.81043075
## 4  4 control baseline  912.6984 -0.56252727
## 5  5 control baseline 1233.8626  0.06117378
## 6  6 control baseline 1290.1337  0.17045225

Centering

Categorical variables

fish_data_interaction %>% 
  mutate(gender_centered = myCenter(gender)) %>% 
  head
##   gender    feed   weight gender_centered
## 1   male control 2358.956             0.5
## 2 female control 2216.687            -0.5
## 3   male control 2616.321             0.5
## 4 female control 1844.046            -0.5
## 5   male control 1898.898             0.5
## 6 female control 2124.016            -0.5

Centering

Categorical variables

## 
## Call:
## lm(formula = rt ~ phase_centered * group_centered, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1503.55  -308.91    -1.22   301.64  1961.29 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1202.36      10.83 110.987   <2e-16 ***
## phase_centered                 -200.67      21.67  -9.262   <2e-16 ***
## group_centered                 -206.97      21.67  -9.552   <2e-16 ***
## phase_centered:group_centered  -399.88      43.33  -9.228   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 484.5 on 1996 degrees of freedom
## Multiple R-squared:  0.1161, Adjusted R-squared:  0.1148 
## F-statistic: 87.39 on 3 and 1996 DF,  p-value: < 2.2e-16

Centering

Do I need to do this?

No, it depends on your research question

  • Fish example: useful to know what the overall effect of gender is
  • Drug example: useful to know that there’s no difference at baseline, no effect of phase on control, and only a change in for experimental group post treatment

Releveling

Changing which is the base level (=0)

Imagine we’d called it “before” and “after” instead of “baseline” and “posttreatment”

Releveling

Changing which is the base level (=0)

Imagine we’d called it “before” and “after” instead of “baseline” and “posttreatment”

## 
## Call:
## lm(formula = rt ~ phase * group, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1503.55  -308.91    -1.22   301.64  1961.29 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1305.479     21.667  60.253   <2e-16 ***
## phasebefore              0.733     30.641   0.024    0.981    
## groupexpt             -406.907     30.641 -13.280   <2e-16 ***
## phasebefore:groupexpt  399.879     43.333   9.228   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 484.5 on 1996 degrees of freedom
## Multiple R-squared:  0.1161, Adjusted R-squared:  0.1148 
## F-statistic: 87.39 on 3 and 1996 DF,  p-value: < 2.2e-16

Releveling

Changing which is the base level (=0)

fish_data_interaction %>% 
  mutate(gender = factor(gender, levels=c("male", "female"))) %>% 
  lm(weight ~ gender,.) %>% 
  summary
## 
## Call:
## lm(formula = weight ~ gender, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -764.40 -178.29   -0.44  171.38  859.98 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2497.984      8.023  311.37   <2e-16 ***
## genderfemale -288.063     11.346  -25.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 253.7 on 1998 degrees of freedom
## Multiple R-squared:  0.2439, Adjusted R-squared:  0.2436 
## F-statistic: 644.6 on 1 and 1998 DF,  p-value: < 2.2e-16

Centering/relevelling

Summary

  • Which you do (if any) depends on what you’re interested in
  • This will change the patterns of significance BUT THAT’S FINE

Power

What should my sample size be if I need to test for an interaction?

  • Big

  • VERY INFORMAL rule of thumb: if you need \(N\) to detect main effect \(b\), might need \(4N\) to detect interaction size \(b\)

  • We’ll worry more about power tomorrow

When should you include an interaction?

  • If, given your best understanding of the relevant literature, it’s plausible that the effect of one variable depends on the value of another
    • e.g., varies by group
  • If you are wanting to say an effect in one group is larger than another group
  • If you are wanting to combine two similar datasets and check if the effect is similar